This week we are learning how to cluster and classificate. First we need to load the Boston-data, from the MASS package.
library(MASS)
data("Boston")
#Then we explore the structure and the dimensions of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The data has 506 objects and 14 variables.This dataset is about housing values in suburbs of Boston. The variables are shortened, so for us to understand what they mean I have explainde the variables below. Variables: “Crim” means per capita crime rate by town. “zn” means proportion of residential land zoned for lots over 25,000 sq.ft. “indus” means proportion of non-retail business acres per town. “chas” means Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). “nox” means nitrogen oxides concentration (parts per 10 million) “rm” means average number of rooms per dwelling “age” means proportion of owner-occupied units built prior to 1940 “dis” means weighted mean of distances to five Boston employment centres “rad” means index of accessibility to radial highways “tax” means full-value property-tax rate per $10,000. “ptratio” means pupil-teacher ratio by town “black” means 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town “lstat” means lower status of the population (percent) “medv” means median value of owner-occupied homes in $1000s
First I downlond few packages I could need during the excercise.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.3
## ✔ tidyr 1.0.0 ✔ dplyr 0.8.3
## ✔ readr 1.3.1 ✔ stringr 1.4.0
## ✔ tibble 2.1.3 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
Next we are going to look at the summary of the Boston dataset
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
If you look at the variables more closely, you can see that almost all the variables are normally distributed. You can know this by seeing that median and mean values are more or less close to each other. Variables “Zn” and “Crim” are not normally distributed. The variable “Chas” is a binary variable.
Next were are going to look for the correlations of the dataset.
#First a picture of the graphical overview
pairs(Boston)
This picture is not very clear. It is a picture of the pairs plot. All the blots are so small and so near to each other thatyou can actually see only black picture. From this picture you cannot see the correlations so next we are going to look at the correlations more closely.
#Let's make a correlation matrix and draw a correlation plot
cormatrix <- cor(Boston)
cormatrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
library(corrplot)
## corrplot 0.84 loaded
corrplot(cormatrix, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
My R couldn’t find a library corrplot, even tough that was instructed in the datacamp. Finally I found a package of corrplot. So after downloading that I could look at the correlations in this picture, which is more clear to me. There is strong negative correlation (big red ball), between dis-nox, dis-age and dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This seems clear and logical. Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is logical. A positive correlation is marked with a big blue ball. So if you look at the picture, you can see that rad and tax have a strong postive correlation. This means that when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises.
Standardizing the dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# The boston_scaled is a matrix and we are going to change it to a data for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
After the data has been scaled you can see from the summary that all the means and medians are close to each other meaning that now they are normally distributed. This will help us in scaling this dat in a fitted model. Next we are going to change the continuous crime rate variable into a categorical variable. We need to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. Then we are going to drop the old crim variable from the dataset and replace it with the new crime variable.
# Creating a quantile vector
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# creating a categorical variable crime
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
library(dplyr)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
dim(boston_scaled)
## [1] 506 14
The data has still 506 objects and 14 variables. The instructions said that we need to divide the dataset to train and test sets, so that only 80% of the data belongs to the train set.
# We are going to make the train and the test sets by choosing 80% observations to the train set and the rest of the observations are in the test set.
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)
dim(ind)
## NULL
train <- boston_scaled[ind, ]
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num 1.015 -0.548 1.015 -0.72 -1.033 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num 1.366 -0.532 1.366 -0.412 -0.386 ...
## $ rm : num 1.577 -0.821 0.593 2.921 0.819 ...
## $ age : num 1.092 0.203 0.761 0.168 0.207 ...
## $ dis : num -0.6375 0.4398 -0.5687 0.0206 -0.4178 ...
## $ rad : num 1.66 -0.522 1.66 -0.178 -0.522 ...
## $ tax : num 1.529 -0.719 1.529 -0.601 -0.666 ...
## $ ptratio: num 0.806 0.529 0.806 -0.488 -0.857 ...
## $ black : num 0.21 0.377 -1.111 0.32 0.379 ...
## $ lstat : num 0.572 -0.128 0.528 -1.426 -0.803 ...
## $ medv : num -0.515 -0.438 -0.667 2.084 0.801 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 4 3 4 3 1 1 3 4 3 4 ...
So now the train set had 404 observations and 14 variables. This is 80% of the 506 observations we began with.
Next we are creating the test set.
test <- boston_scaled[-ind, ]
str(test)
## 'data.frame': 102 obs. of 14 variables:
## $ zn : num -0.4872 0.0487 0.0487 -0.4872 -0.4872 ...
## $ indus : num -0.593 -0.476 -0.476 -0.437 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.74 -0.265 -0.265 -0.144 -0.144 ...
## $ rm : num 0.194 -0.388 -0.399 -0.478 -0.455 ...
## $ age : num 0.3668 -0.0702 0.6155 -0.2407 0.7327 ...
## $ dis : num 0.557 0.838 1.328 0.433 0.103 ...
## $ rad : num -0.867 -0.522 -0.522 -0.637 -0.637 ...
## $ tax : num -0.986 -0.577 -0.577 -0.601 -0.601 ...
## $ ptratio: num -0.303 -1.504 -1.504 1.175 1.175 ...
## $ black : num 0.441 0.426 0.329 0.441 0.393 ...
## $ lstat : num -0.492 -0.0312 0.6227 -0.6152 0.1648 ...
## $ medv : num -0.1014 0.0399 -0.395 -0.2319 -0.3189 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 2 2 3 3 3 3 3 3 3 ...
The test set has 102 observations and 14 variables. This is 20% of the original data set.
We are using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. We are going to draw the LDA (bi)plot of the model.
#Fitting the linear discriminant analysis on the train set
lda.fit <- lda(formula = crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#Target classes as numeric
classes <- as.numeric(train$crime)
#Drawing a plot of the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
#Next we are going to save the crime categories from the test set and then remove the categorical crime variable from the test dataset. After that we are going to predict the classes with the LDA model on the test data.
#We are going to save the crime categories from the test set
correct_classes <- test$crime
library(dplyr)
#Next we are going to remove the categorial crime variable from the test dataset
test <- dplyr::select(test, -crime)
# Next we are going to predict with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)
#Then was asked to do a cross table of the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 5 3 0
## med_low 6 12 5 0
## med_high 0 10 11 1
## high 0 0 0 31
Here you can see how the model is working with the predictions. The model works well when predicting the high crime rates. The model makes errors when predicting the other crime classes.
#Next we are going load the Boston dataset again and standardize the dataset. We are going to calculate the distances between the observations. We are going to run k-means algorithm on the dataset. We are going to investigate what is the optimal number of clusters and run the algorithm again. In the end we are going to visualize the clusters and interpret the results.
library(MASS)
data("Boston")
dim(Boston)
## [1] 506 14
We now have loaded the Boston dataset again from the MASS-library. We wanted to check that we have the correct amount of observations 506 and variables 14.
Next we are going to measure the distance. I am going to use the Euklidean-distance, which is probably the most common one. K-means is old and often used clustering method. K-means counts the distance matrix automatically but you have to choose the number of clusters. I made the model with 3 clusters, because my opinion is that it worked better than 4 clusters.
scale_Boston2 <- scale(Boston)
scale_Boston2 <- as.data.frame(scale_Boston2)
#Next we are going to calculate the distances, the Euklidean-distance.
dist_eu <- dist(scale_Boston2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# K-means clustering
km <- kmeans(scale_Boston2, centers = 3)
#Plotting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)
pairs(scale_Boston2[1:6], col = km$cluster)
pairs(scale_Boston2[7:13], col = km$cluster)
Next were are going to investigate what is the optimal number of clusters. There are many ways to find out the optimal number of clusters, but we will use the Total of within cluster sum of squares (WCSS) and visualise the result with a plot.
# First determine the number of clusters
k_max <- 10
# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})
# Next we are going to visualize the results
qplot(x = 1:k_max, y = twcss, geom = "line")
The optimal number of clusters is when the total WCSS drops radically. As you can see from the picture above this happens around, when x= 2. So the optimal number of clusters would be 2. Next we run the algorithm again with two clusters.
km <- kmeans(scale_Boston2, centers = 2)
pairs(scale_Boston2, col = km$cluster)
In this first picture all the variables are included. Because there are so many variables I think the picture is quite difficult to interpret and understand. That is why I choose to do two more plots, so that I can look at the effects more closely.
pairs(scale_Boston2[1:8], col = km$cluster)
pairs(scale_Boston2[6:13], col = km$cluster)
As you can see from the pictures above the variable “chas” doesn’t follow any pattern with any of the variables. These kind of pictures are hard for me to understand because I am doing this the very first time. I think, however, that there might be negative correlation between indus-dis, nox-dis, dis-lstat and positive correlation between indus-nox, age-nox, age-lstat.